The purpose of this project is to examine the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) and related healthcare quality and patient safety metrics that capture patient’s overall experience at a hospital.

As part of the Affordable Care Act, the government created a national, standardized, publicly reported survey of the patient’s expirience by hospital (i.e. the HCAHPS). The HCAHPS contains qualitative measures of a patient’s experience including cleanliness of the hospital, pain management, effective communication by doctors and nurses, etc. HCAHPS Scores are based on surveys sent to patients by the Centers for Medicaid and Medicare (CMS). A random sampling of patients receive surveys from CMS. The survey responses are controlled for a number of factors including demographics and hospital location. CMS publishes a star rating for each of the survey questions and for an overall score, the highest rating a hospital can get is 5 stars and the lowest is 1 star.The survey questions are primarily centered around patient satisfaction and quality of service in the hospital, such as: how well did your doctor communicate with you? Did the staff explain the medications you were given? Was your pain well managed? Was it quiet?

Our project will create a predictive model of star ratings for NY hospitals. First, we will look at other metrics within HCAHPS that can be leveraged within our model. Then, we will explore other hospital/Medicare databases and see whether metrics such as cost, mortality rates, infection rates, readmission rates, and waiting times are also predictive of star ratings.

Load files from github repo:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(gplots)
## Warning: package 'gplots' was built under R version 3.2.4
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(ggplot2)
library(broom)
library(knitr)
library(caret)
## Loading required package: lattice
#cost sets
paymentandvalue <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Payment_and_value_of_care_-_Hospital.csv") #national
inpatientcost <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Hospital_Inpatient_Cost_Transparency__Beginning_2009.csv") #NYS

#risk of badness sets
PCIbyhospital <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Cardiac_Surgery_and_Percutaneous_Coronary_Interventions_by_Hospital___Beginning_2008.csv") #NYS

complicationshospital <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Readmissions_Complications_and_Deaths_-_Hospital.csv") #national

#service
hcahps <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/hcahps.csv") #national
timely <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Timely_and_Effective_Care_-_Hospital.csv") #national

#readmisisons
readmission <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Hospital_Readmissions_Reduction_Program.csv") #national 

#infections
hospitalinfection <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Hospital-Acquired_Infections__Beginning_2008.csv") #ny

Wrangling

Our datasets are publically available from either New York State or the federal government. New York State and the federal government have different naming conventions and don’t share a common identifier for hospitals. In order to get compary data from the two sources we need to wrangle the data.

Firstly, let’s limit national data sets to New York State only:

paymentandvalue <- paymentandvalue %>% filter(State=="NY")
complicationshospital <- complicationshospital %>% filter(State=="NY")
hcahps <- hcahps %>% filter(State=="NY")
timely <- timely %>% filter(State=="NY")
readmission <- readmission %>% filter(State=="NY")

Additional wrangling of factor to numeric

complicationshospital$Denominator <- as.numeric(as.character(complicationshospital$Denominator))
complicationshospital$Score <- as.numeric(as.character(complicationshospital$Score))

inpatientcost$Mean.Charge <- sub('\\$','', as.character(levels(inpatientcost$Mean.Charge))[inpatientcost$Mean.Charge]) %>%
  as.numeric()
inpatientcost$Median.Charge <- sub('\\$','', as.character(levels(inpatientcost$Median.Charge))[inpatientcost$Median.Charge]) %>%
  as.numeric()

paymentandvalue$Payment <- sub('\\$','', as.character(levels(paymentandvalue$Payment))[paymentandvalue$Payment]) %>%
  as.numeric()
paymentandvalue$Denominator <- sub('\\,','', as.character(levels(paymentandvalue$Denominator))[paymentandvalue$Denominator]) %>%
  as.numeric()

timely$Score <- as.character(levels(timely$Score))[timely$Score] %>%
  as.numeric()

#change provider ID to numeric
timely$Provider.ID <- as.character(levels(timely$Provider.ID))[timely$Provider.ID] %>%
  as.numeric()

complicationshospital$Provider.ID <- as.character(levels(complicationshospital$Provider.ID))[complicationshospital$Provider.ID] %>%
  as.numeric()

paymentandvalue$Provider.ID <- as.character(levels(paymentandvalue$Provider.ID))[paymentandvalue$Provider.ID] %>%
  as.numeric()

Becuause there are no common identifiers between national and state data (national uses Provider ID and NY state uses Facility ID), we will have to perform our join using hospital names.

#use hospital infection naming convention as that is the closest to the provider naming convention
facilityID_data <- hospitalinfection %>%
  filter(Facility.Id > 0, Year == 2014 ) %>%
  select(Facility.Id, Hospital.Name, Infections.observed) %>%
  group_by(Facility.Id, Hospital.Name) %>%
  summarise(Infections = sum(Infections.observed) )

#adjust col names of PCI
colnames(PCIbyhospital)[1] <- "Facility.Id"
colnames(PCIbyhospital)[6] <- "Year"

#only take risk adjusted mortality rate and the latest year
facilityID_data <- PCIbyhospital %>%
  filter(Year == 2011, Procedure == "All PCI") %>%
  select(c(Facility.Id, Number.of.Cases, Risk.Adjusted.Mortality.Rate)) %>%
  right_join(facilityID_data, by = "Facility.Id")


#clean case convention and column name
facilityID_data$Hospital.Name <- toupper(facilityID_data$Hospital.Name)

#use complicationshospital as the conventional naming
providerID_data <- complicationshospital %>%
  select(c(Provider.ID, Hospital.Name))

Now let’s try to join the ProviderID dataset with the FacilityID dataset using the Hospital/Facility Name. First, we need to clean up some of the varying naming conventions between the two datasets. We will use ProviderID as the convention given that is the one used by HCAHPS

#find the hospital names that are slightly different between the datasets

facilityID_data$Hospital.Name <- as.character(facilityID_data$Hospital.Name)
providerID_data$Hospital.Name <- as.character(providerID_data$Hospital.Name)

FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

HospitalNames <- providerID_data %>%
  select(Hospital.Name, Provider.ID) %>%
  as.data.frame() %>%
  unique()

#merge the tables and find where there is unmerged rows
ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE)

Quickly scanning the list of merged names, we notice that a common difference in convention between the datasets is the use of “INC” vs “, INC”. Let’s update FacilityID from “INC” to “, INC”.

#Update FacilityID's uses "INC" to match ProviderID's ", INC"
facilityID_data$Hospital.Name <- gsub(" INC", ", INC", facilityID_data$Hospital.Name)

#rerun the merged datasets
FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE)

It also looks like the facility database has separate IDs for different divisions of a hospital provider. Let’s remap all these divisions to the parent provider

#rename Catskill to remove from further cleaning as we don't want to remove the - in this case
facilityID_data$Hospital.Name <- gsub("CATSKILL REGIONAL MEDICAL CENTER - G. HERMANN SITE", "CATSKILL REGIONAL MEDICAL CENTER G HERMANN SITE", facilityID_data$Hospital.Name)

#remove division name following -
facilityID_data$Hospital.Name <- gsub(" - [A-z ]*", "", facilityID_data$Hospital.Name)

#rerun the merged datasets
FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE)

We still have a couple more manual changes to make

#create table remapping facility name to hospital name
facility_orig = c("ADIRONDACK MEDICAL CENTER-SARANAC LAKE SITE", "BROOKHAVEN MEMORIAL HOSPITAL MEDICAL CENTER INC", "BROOKLYN HOSPITAL CENTER AT DOWNTOWN CAMPUS", "CATSKILL REGIONAL MEDICAL CENTER - G. HERMANN SITE", "CHAMPLAIN VALLEY PHYSICIANS HOSPITAL MEDICAL CENTER", "CLIFTON-FINE HOSPITAL", "FAXTON-ST LUKES HEALTHCARE ST LUKES DIVISION", "FAXTON-ST LUKES HEALTHCARE FAXTON DIVISION", "JOHN T MATHER MEMORIAL HOSPITAL OF PORT JEFFERSON NEW YORK, INC", "MARGARETVILLE HOSPITAL", "MONTEFIORE", "MONTEFIORE MED CENTER", "MONTEFIORE MEDICAL CENTER& LUCY MOSES DIV", "MOUNT VERNON HOSPITAL", "MOUNT ST MARYS HOSPITAL AND HEALTH CENTER", "NEW YORK COMMUNITY HOSPITAL OF BROOKLYN,, INC", "NEW YORK PRESBYTERIAN HOSPITAL", "RIVER HOSPITAL,, INC.", "ROME MEMORIAL HOSPITAL,, INC", "SCHUYLER HOSPITAL", "SOLDIERS AND SAILORS MEMORIAL HOSPITAL OF YATES, INC", "ST JOHNS EPISCOPAL HOSPITAL SO SHORE", "ST JOSEPHS HOSPITAL", "ST JOSEPHS HOSPITAL HEALTH CENTER", "ST JOSEPH'S MC-ST VINCENTS WESTCHESTER DIVISION", "ST LUKE'S CORNWALL HOSPITAL/NEWBURGH", "ST LUKES ROOSEVELT HOSPITAL", "ST LUKES ROOSEVELT HOSPITAL CENTER", "ST PETERS HOSPITAL", "ST. JOSEPH HOSPITAL", "STATEN ISLAND UNIVERSITY HOSP-NORTH", "STATEN ISLAND UNIVERSITY HOSP-SOUTH", "UNITED HEALTH SERVICES HOSPITALS, INC.", "UNITED MEMORIAL MEDICAL CENTER BANK STREET CAMPUS", "UNITED MEMORIAL MEDICAL CENTER NORTH STREET CAMPUS", "UNIVERSITY HOSPITAL", "UNIVERSITY HOSPITAL OF BROOKLYN", "UNIVERSITY HOSPITAL SUNY HEALTH SCIENCE CENTER", "WOMANS CHRISTIAN ASSOC HOSPITAL", "WOODHULL MEDICAL & MENTAL HEALTH CENTER")

facility_new = c("ADIRONDACK MEDICAL CENTER", "BROOKHAVEN MEMORIAL HOSPITAL MEDICAL CENTER", "BROOKLYN HOSPITAL CENTER", "CATSKILL REGIONAL MEDICAL CENTER - G HERMANN SITE", "CHAMPLAIN VALLEY PHYSICIANS HOSPITAL MEDICAL CTR", "CLIFTON FINE HOSPITAL", "FAXTON-ST LUKE'S HEALTHCARE", "FAXTON-ST LUKE'S HEALTHCARE", "JOHN T MATHER MEMORIAL HOSPITAL OF PORT JEFFERSON", "MARGARETVILLE MEMORIAL HOSPITAL", "MONTEFIORE MEDICAL CENTER", "MONTEFIORE MEDICAL CENTER", "MONTEFIORE MEDICAL CENTER", "MONTEFIORE MOUNT VERNON HOSPITAL", "MOUNT ST MARY'S HOSPITAL AND HEALTH CENTER", "NEW YORK COMMUNITY HOSPITAL OF BROOKLYN, INC.", "NEW YORK-PRESBYTERIAN HOSPITAL", "RIVER HOSPITAL, INC", "ROME MEMORIAL HOSPITAL, INC", "SCHUYLER HOSPITAL, INC", "SOLDIERS AND SAILORS MEMORIAL HOSPITAL OF YATES", "ST JOHN'S EPISCOPAL HOSPITAL AT SOUTH SHORE", "ST JOSEPH'S HOSPITAL, INC", "ST JOSEPH'S HOSPITAL HEALTH CENTER", "ST JOSEPH'S MEDICAL CENTER", "ST LUKE'S CORNWALL HOSPITAL", "ST LUKE'S ROOSEVELT HOSPITAL","ST LUKE'S ROOSEVELT HOSPITAL CENTER", "ST PETER'S HOSPITAL", "ST JOSEPH HOSPITAL", "STATEN ISLAND UNIVERSITY HOSPITAL", "STATEN ISLAND UNIVERSITY HOSPITAL", "UNITED HEALTH SERVICES HOSPITALS, INC", "UNITED MEMORIAL MEDICAL CENTER", "UNITED MEMORIAL MEDICAL CENTER", "UNIVERSITY HOSPITAL (STONY BROOK)", "UNIVERSITY HOSPITAL OF BROOKLYN (DOWNSTATE)", "UNIVERSITY HOSPITAL S U N Y HEALTH SCIENCE CENTER", "WOMAN'S CHRISTIAN ASSOCIATION HOSPITAL", "WOODHULL MEDICAL AND MENTAL HEALTH CENTER")

#renaming table
remapped_names <- data.frame("Facility Name" = facility_orig, "Provider Name" = facility_new, stringsAsFactors = FALSE)

#rename the hospitals in the facilityID_data
facilityID_data$Hospital.Name <- sapply(facilityID_data$Hospital.Name, function(x){
  ifelse(x %in% remapped_names$Facility.Name, remapped_names[facility_orig == x, 2], x)
})

#rerun the merged datasets
FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE) %>%
  na.omit() %>%
  select(-c(Hospital.Name))

We now have a key matching facility ID to provider ID for around 130 hospitals. We can use this later when we are looking to investigate the relationship between metrics from the national datasets (eg HCAHPS, readmission) and the New York datasets (eg infection rates, payments)

Part 1: Examining HCAHPS

For HCAHPS, we are only really concerned with the linear score and the star rating. Let’s keep only those metrics and spread the measures so that each measure ID is its own column.

#spread the ratings and the other scores
HCAHPS <- hcahps %>%
  select(Provider.ID, HCAHPS.Measure.ID, Patient.Survey.Star.Rating) %>%
  filter(HCAHPS.Measure.ID == "H_STAR_RATING") %>%
  spread(HCAHPS.Measure.ID, Patient.Survey.Star.Rating)

#only keep the aggregate metrics
HCAHPS <- hcahps %>%
  select(Provider.ID, HCAHPS.Measure.ID, HCAHPS.Linear.Mean.Value) %>%
  filter(HCAHPS.Measure.ID %in% c("H_CLEAN_LINEAR_SCORE", "H_COMP_1_LINEAR_SCORE", "H_COMP_2_LINEAR_SCORE","H_COMP_3_LINEAR_SCORE", "H_COMP_4_LINEAR_SCORE","H_COMP_5_LINEAR_SCORE", "H_COMP_6_LINEAR_SCORE","H_COMP_7_LINEAR_SCORE","H_HSP_RATING_LINEAR_SCORE", "H_QUIET_LINEAR_SCORE")) %>%
  right_join(HCAHPS, by = "Provider.ID")

#clean NAs
HCAHPS[HCAHPS == "Not Available"] = NA
HCAHPS <- na.omit(HCAHPS)

#transform classes
HCAHPS$HCAHPS.Linear.Mean.Value <- as.numeric(as.character(HCAHPS$HCAHPS.Linear.Mean.Value))

Let’s plot these scores and see if there is any relationship to star rating.

HCAHPS %>%
  ggplot(aes(H_STAR_RATING, HCAHPS.Linear.Mean.Value)) +
  geom_boxplot() +
  ylab("Linear Score") +
  xlab("Star Rating") +
  ggtitle("Star Rating vs. Linear Score of Other HCAHPS Metrics")

A quick look at the aggregated scores against star ratings boxplot shows a strong positive relationship between the scores and the star rating. Now, let’s see which of the scores are most predictive of a hospitals star rating.

#find the distance between the scores and star rating
HCAHPS <- HCAHPS %>%
  spread(HCAHPS.Measure.ID, HCAHPS.Linear.Mean.Value) 

HCAHPS_sum <- HCAHPS %>%
  select(-c(Provider.ID))

#let's make the column names friendly
colnames(HCAHPS_sum) <- c("Star_Rating", "Cleanliness", "NurseComm", "DoctorComm", "StaffResponsiveness", "PainMmgt.", "MedicationComm.", "Discharge_Info", "Care_Transition", "Overall_Rating", "Quietness")

#turn the star rating into a number
HCAHPS_sum$Star_Rating <- as.numeric(as.character(HCAHPS_sum$Star_Rating))

#we have to standardize bc star rating is in a different unit
HCAHPS_dist <- as.matrix(HCAHPS_sum) %>%
  scale() %>%
  t() %>%
  dist()

heatmap.2(as.matrix(HCAHPS_dist), 
          margins = c(10,10),
          cellnote= round(as.matrix(HCAHPS_dist),4), 
          notecex=1.0,
          notecol="cyan",
          dendrogram = 'none',
          trace='none',
          key = FALSE)

It looks like nurse communication is the closest to star rating, followed by Medication Communication.

Now, let’s see if we can build a knn model from these additional HCAHPS metrics. First, we will need to determine the best K to use.

set.seed(1)
control <- trainControl(method='cv', number=10, p=.7)

#we will need to center and scale since we are using different units
res_knn <- train(Star_Rating ~ .,
             data = HCAHPS_sum,
             method = "knn",
             trControl = control,
             preProcess = c("center","scale"),
             tuneLength = 1,
             tuneGrid=data.frame(k=seq(1,15,1)))

res_knn$results %>% 
  ggplot(aes(k, RMSE, ymin= RMSE - RMSESD, 
             ymax = RMSE + RMSESD)) +
  geom_point() + geom_errorbar() +
  ggtitle("RMSE over K")

res_knn
## k-Nearest Neighbors 
## 
## 154 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 139, 139, 138, 138, 139, 139, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   RMSE SD     Rsquared SD
##    1  0.3550570  0.8146416  0.16589106  0.11909805 
##    2  0.3249938  0.8633711  0.07835016  0.04686860 
##    3  0.3357165  0.8480199  0.07676584  0.05333952 
##    4  0.3309172  0.8510118  0.08068019  0.05460303 
##    5  0.3211934  0.8605531  0.06832507  0.05108204 
##    6  0.3123021  0.8703599  0.07591906  0.04591637 
##    7  0.3111902  0.8700788  0.07583596  0.05406197 
##    8  0.3055101  0.8743962  0.07269430  0.05395120 
##    9  0.3046925  0.8748050  0.07032935  0.05860115 
##   10  0.3107535  0.8724046  0.06738323  0.05499346 
##   11  0.3112200  0.8742018  0.06260791  0.04959820 
##   12  0.3176261  0.8710675  0.05717181  0.04746055 
##   13  0.3157013  0.8728423  0.05785688  0.04729927 
##   14  0.3219121  0.8690111  0.06167045  0.05196969 
##   15  0.3247508  0.8681348  0.05592365  0.04542577 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 9.

From the above RMSE graph, it looks like a k of 9 will give us the best ratings.

set.seed(1)
#let's create a model with k = 9
inTrain <- createDataPartition(y = HCAHPS_sum$Star_Rating , p = 0.8)
train_set <- slice(HCAHPS_sum, inTrain$Resample1)
test_set <- slice(HCAHPS_sum, -inTrain$Resample1)

#reset factors for test set
test_set$Star_Rating  <- factor(test_set$Star_Rating )

#fit the data
pred <- predict(res_knn, newdata = test_set)
tab <- table(pred = round(pred), truth = test_set$Star_Rating)

confusionMatrix(tab)$overall["Accuracy"]
##  Accuracy 
## 0.8965517

From the above, we see that the remaining metrics in HCAHPS has about a 90% accuracy rating in predicting star ratings.

What happens when we try to replicate this 10 times?

knn_10 <- sapply(1:10, function(i){
  inTrain <- createDataPartition(y = HCAHPS_sum$Star_Rating , p = 0.8)
  train_set <- slice(HCAHPS_sum, inTrain$Resample1)
  test_set <- slice(HCAHPS_sum, -inTrain$Resample1)
  
  test_set$Star_Rating  <- factor(test_set$Star_Rating )
  
  pred <- predict(res_knn, newdata = test_set)
  
  #special treatment to catch corner cases where the number of stars in the prediction don't match the number of stars in the truth (i.e. there is a 4 star in the truth but not in the prediction)
  
  pred = round(pred)
  truth = test_set$Star_Rating
  u = union(pred, truth)
  tab <- table(factor(pred, u), factor(truth, u))
  
  confusionMatrix(tab)$overall["Accuracy"] 
})

mean(knn_10)
## [1] 0.9103448
sd(knn_10)
## [1] 0.04654818

We still have an average accuracy of around 90%, with a standard deviation of around 4%.

Now, we will expand our search outside of the HCAHPS and see if metrics from other databases can help us improve our predicitons. Before we link these new metrics to HCAHPS, we will first explore each dataset and examine some additional hypothesis that would impact patient experience.

Data exploration of other potential predictors

As we have several other potential predictors in our other datasets, we will limit this preliminary data exploration to just those that have cost and mortality metrics, as these two are by conventional wisdom most likely to impact patient experience. We will examine the relationship that these two measures have with a critical third measure: the hospital case load. Our general hypothesis is that case load is a sign of experience treating a certain disorder. This may lead to better and more efficient treatment that will lower costs and increase survival.

Cost

We will start our cost examination with the inpatientcost dataset. This dataset allows us to look at cost, controlling for the severity of the injury/proceduce. First, let us sanity check this dataset and verify that cost is higher for more severe treatments.

Severity<- inpatientcost %>% 
  filter(APR.DRG.Description=="Heart Failure")

#reorder the factors from minor to extreme
Severity$APR.Severity.of.Illness.Description <- factor(Severity$APR.Severity.of.Illness.Description, levels = c("Minor", "Moderate", "Major", "Extreme"), ordered = TRUE)
  

#boxplot of charge by severity
Severity %>%
  ggplot(aes(APR.Severity.of.Illness.Description, Mean.Charge)) +
  geom_boxplot() +
  ylab("Mean Charge") +
  xlab("Serverity of Illness")

#we will also check the median to avoid issues with skews
Severity %>%
  ggplot(aes(APR.Severity.of.Illness.Description, Median.Charge)) +
  geom_boxplot() +
  ylab("Median Charge") +
  xlab("Serverity of Illness")

As expected, it appears extreme cases have higher charges than the other cases. That being said, interestingly enough, there is not much difference in cost between major and minor treatments.

Let’s see if there is a relationship between dischage and mean cost in this dataset. The idea behind this hypothesis is that hospitals that treat more patients for a certain procedure or disorder may have improved efficiencies that allows them to lower the treatment charge.

#plot of mean charge vs number of cases
Severity %>%
  ggplot(aes(Discharges, Mean.Charge)) +
  geom_point() +
  facet_wrap(~APR.Severity.of.Illness.Description, scales = "free_x")

#median
Severity %>%
  ggplot(aes(Discharges, Median.Charge)) +
  geom_point() +
  facet_wrap(~APR.Severity.of.Illness.Description, scales = "free_x")

Unfortunately, it looks like there is no relationship between number of patients seen and cost of treatment. This suggests that, contrary to expectation, hospitals that see more cases of a certain disorder/disease may not actually have an advantage in scalibility to reduce costs.

Mortality

We now move on to looking at the relationship between mortality and number of cases.

First, we will re-examine our earlier test of the relationship between cost and number of patients seen.

#clean the data
PV <- paymentandvalue %>%
  select(Provider.ID, Denominator, Payment, Payment.measure.ID, Value.of.care.category) %>%
  na.omit() %>%
  filter(Denominator != "Not Available")

#plot cost and number of cases a hospital sees
ggplot(PV, aes(Denominator, Payment)) + 
  geom_point() +
  facet_wrap(~Payment.measure.ID, scales = "free_x") +
  xlab("Case count")

Again, we see that there is not really a change in cost for a treatment with increased number of cases seen. For pneumonia, we do see some normalization of cost to a little under 15000 once the case count gets beyond 500.

Let’s now see whether value of care, as defined by mortality and payment intersection, is impacted by the number of cases seen.

#update the order
PV$Value.of.care.category <- factor(PV$Value.of.care.category, levels = c("Worse mortality and lower payment", "Worse mortality and average payment", "Worse mortality and higher payment", "Average mortality and lower payment", "Average mortality and average payment", "Average mortality and higher payment", "Better mortality and lower payment", "Better mortality and average payment", "Better mortality and higher payment"), ordered = TRUE)

#let's now check number of cases and value of care
PV %>%
  ggplot(aes(Value.of.care.category, Denominator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Number of Cases") +
  xlab("Value of Care") +
  ggtitle("Value of Care vs. Number of Cases")

#let's check heart disease in particular
PV %>% filter(Payment.measure.ID == "PAYM_30_HF") %>%
  ggplot(aes(Value.of.care.category, Denominator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Number of Cases") +
  xlab("Value of Care") +
  ggtitle("Value of Care vs. Number of Cases for Heart Failure")

From the boxplots, we can see that number of cases seen does not appear to vary much between average mortality and worse mortality outcomes. However, it does appear that hospitals with better mortality tend to have seen more cases.

Let’s focus on just mortality, ignoring the payment dimension, and observe the impact of cases there.

#bucket the mortality rates, ignoring payment
PV <- PV %>%
  mutate(Care.Cat = ifelse(grepl("Better mortality", Value.of.care.category), "Better Mortality", 
                           ifelse(grepl("Average mortality", Value.of.care.category), "Average Mortality", "Worse Mortality")))

#reorder
PV$Care.Cat <- factor(PV$Care.Cat, levels = c("Worse Mortality", "Average Mortality", "Better Mortality"), ordered = TRUE)

#let's plot a boxplot of the number of cases by group
PV %>%
  ggplot(aes(Care.Cat, Denominator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Number of Cases") +
  xlab("Mortaity Rate") +
  facet_wrap(~Payment.measure.ID, scales = "free_x")

In ignoring the payment categories and focusing on just mortality, we can see that mortality does seem to be better at hospitals that observe more cases for all three types of illnesses. Let us see if there is enough of a relationship for us to predict mortality rates using number of cases seen by a hospital. We will focus on heart failure.

set.seed(1)

#turn character into numeric
PV_HF <- mutate(PV, Care.Cat = ifelse(grepl("Better Mortality", Care.Cat), 3, ifelse(grepl("Average Mortality", Care.Cat), 2, 1))) %>%
  filter(Payment.measure.ID == "PAYM_30_HF")


#create train and test sets
inTrain <- createDataPartition(y = PV_HF$Care.Cat, p = 0.8)
train_set <- slice(PV_HF, inTrain$Resample1)
test_set <- slice(PV_HF, -inTrain$Resample1)

#fit the data
fit <- glm(Care.Cat ~ Denominator, data=train_set)
pred <- predict(fit, newdata = test_set, type = "response")

pred = round(pred)
truth = test_set$Care.Cat
u = union(pred, truth)

tab <- table(factor(pred, u), factor(truth, u))

confusionMatrix(tab)$overall["Accuracy"]
##  Accuracy 
## 0.9333333

It looks like just using the number of cases is actually fairly predictive of the mortality rates for heart failure at hospitals. Let’s see if we can replicate this with a more common disorder like pneumonia:

set.seed(1)

#turn character into numeric
PV_PN <- mutate(PV, Care.Cat = ifelse(grepl("Better Mortality", Care.Cat), 3, ifelse(grepl("Average Mortality", Care.Cat), 2, 1))) %>%
  filter(Payment.measure.ID == "PAYM_30_PN")


#create train and test sets
inTrain <- createDataPartition(y = PV_PN$Care.Cat, p = 0.8)
train_set <- slice(PV_PN, inTrain$Resample1)
test_set <- slice(PV_PN, -inTrain$Resample1)

#fit the data
fit <- glm(Care.Cat ~ Denominator, data=train_set)
pred <- predict(fit, newdata = test_set, type = "response")
tab <- table(pred = round(pred), truth = test_set$Care.Cat)

tab
##     truth
## pred  1  2  3
##    2  1 28  2

From the above, we can see that number of cases seen does not have a strong predictive capability for mortality rates for pneumonia. So, the effect we observed for heart failure may just be limited to more complex diseases or diseases that may require more surgical intervention.

Let’s see what happens when we leverage another database that has actual mortality rates. We will focus on heart failure for now.

CH <- complicationshospital %>%
  select(Measure.ID, Denominator, Score)

#let's look at the relationship for HF only
CH %>%
  filter(Measure.ID == "MORT_30_HF") %>%
  ggplot(aes(Denominator, Score)) +
  geom_point() +
  geom_smooth() +
  ylab("Heart Failure Mortality Rate") +
  xlab("Number of Cases")

It looks like there is a slight increase in mortality rate when the number of cases increase, but then it drops off as the number of cases continue to grow. Let’s now explore additional measures such as mortality rate for AMI and Strokes.

#let's add on the relationship for AMI and STK
CH %>%
  filter(Measure.ID %in% c("MORT_30_HF", "MORT_30_AMI", "MORT_30_STK")) %>%
  ggplot(aes(Denominator, Score, colour = Measure.ID)) +
  geom_point() +
  geom_smooth()  +
  ylab("Mortality Rate") +
  xlab("Number of Cases")

The drop in mortality rate is even more pronounced for AMI.

Let’s see what happens when we try to fit a regression to number of cases and mortality rate of AMI and HF.

CH_HF <- CH %>%
  filter(Measure.ID == "MORT_30_HF") %>%
  na.omit()
  
CH_AMI <- CH %>%
  filter(Measure.ID == "MORT_30_AMI") %>%
  na.omit()

res_HF <- CH %>%
  filter(Measure.ID %in% c("MORT_30_HF")) %>%
  lm(Score ~ Denominator, data = .) %>%
  tidy(conf.int = TRUE)

res_AMI <- CH %>%
  filter(Measure.ID %in% c("MORT_30_AMI")) %>%
  lm(Score ~ Denominator, data = .) %>%
  tidy(conf.int = TRUE)

kable(res_HF)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 12.0604626 0.1759339 68.551118 0.000000 11.7130270 12.4078982
Denominator -0.0014524 0.0003698 -3.927509 0.000127 -0.0021827 -0.0007221
kable(res_AMI)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 15.2156647 0.1626288 93.56068 0e+00 14.894139 15.5371906
Denominator -0.0031339 0.0006045 -5.18433 7e-07 -0.004329 -0.0019388

From the above, we can see that with each additional case of HF, a hospital’s rate of mortality decreases between 0.1% to 0.2%. For AMI, the rate of decrease is even higher at 0.1% to 0.4%.

Another database that has mortality rating is PCIbyhospital. Let’s see if we can find the same relationship there.

#removing observed mortality rate of 0
PCI_mort <- PCIbyhospital %>%
  filter(Procedure == "All PCI", Hospital.Name != "NYS - All Hospitals", Observed.Mortality.Rate >0) %>%
  select(Number.of.Cases, Observed.Mortality.Rate, Risk.Adjusted.Mortality.Rate)

#plot Risk Adjusted rates
PCI_mort %>%
  ggplot(aes(Number.of.Cases, Risk.Adjusted.Mortality.Rate)) +
  geom_point() +
  geom_smooth(span = .5) +
  xlab("Risk Adjusted Mortality Rate") +
  ylab("Number of PCI Cases")

#plot observed rate
PCI_mort %>%
  ggplot(aes(Number.of.Cases, Observed.Mortality.Rate)) +
  geom_point() +
  geom_smooth() +
  geom_smooth(span = .5) +
  xlab("Observed Mortality Rate") +
  ylab("Number of PCI Cases")

We see a fairly sharp drop in mortality rate that plateaus once the number of cases surpasses 500. This appears to apply to both the observed and risk adjusted mortality rate. Let’s now fit a regression to number of cases against the two rates.

res_Obs <- PCI_mort %>%
  lm(Risk.Adjusted.Mortality.Rate ~ Number.of.Cases, data = .) %>%
  tidy(conf.int = TRUE)

res_Adj <- PCI_mort %>%
  lm(Observed.Mortality.Rate ~ Number.of.Cases, data = .) %>%
  tidy(conf.int = TRUE)

kable(res_Obs)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.1268476 0.0511208 22.042825 0.0000000 1.025939 1.2277568
Number.of.Cases -0.0001226 0.0000428 -2.864481 0.0047006 -0.000207 -0.0000381
kable(res_Adj)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.4405224 0.0839115 17.167165 0.00e+00 1.2748866 1.6061581
Number.of.Cases -0.0003081 0.0000702 -4.386258 2.01e-05 -0.0004467 -0.0001694

We see about a similar relationship but at a decreased magnitude of 0.01%-0.04% decrease in mortality rate for each additional case seen.

Final Model for HCAHPS Star Rating

Now that we have explored the other datasets, we will start examining the relationship between these new metrics and HCAHPS star rating to see if we can improve our model.

Cost and Mortality

First, we will examine cost. Cost is notoriously nebulous and difficult to understand in healthcare. It is nearly impossible to get a quote on the cost of a procedure before going to the hospital. Often patients never see the bill after going to the hospital. So we wondered what effect, if any, cost would have on patient satisfaction. Are patients less likely to be satisfied with an expensive hospital, or perhaps the care and overall expirience are better at more expensive hospitals and so they are more likely to get high ratings?

#add cost metrics
cost <- paymentandvalue %>%
  select(c(Provider.ID, Hospital.name, Payment.measure.ID, Denominator, Payment)) %>%
  right_join(HCAHPS, by = "Provider.ID")

#plot our results by case
cost %>% 
  filter(Payment.measure.ID == "PAYM_30_AMI") %>%
  ggplot(aes(H_STAR_RATING, Payment)) +
  geom_boxplot() +
  ggtitle("Payment for AMI vs. Star Rating") +
  xlab("Star Rating")

cost %>% 
  filter(Payment.measure.ID == "PAYM_30_HF") %>%
  ggplot(aes(H_STAR_RATING, Payment)) +
  geom_boxplot() +
  ggtitle("Payment for Heart Failure vs Star Rating") +
  xlab("Star Rating")

cost %>% 
  filter(Payment.measure.ID == "PAYM_30_PN") %>%
  ggplot(aes(H_STAR_RATING, Payment)) +
  geom_boxplot() +
  ggtitle("Payment for Pneumonia vs Star Rating") +
  xlab("Star Rating")

Interestingly, cost does not appear to have a consistent relationship with star rating. Only cost for pneumonia treatment appears to be decreasing with increased star rating.

What about mortality rates?

#select the columns we want relating to mortality
mort <- complicationshospital %>%
  select(c(Provider.ID, Hospital.Name, Measure.ID, Denominator, Score)) %>%
  spread(Measure.ID, Score)
  
mort[mort == "Not Available"] = NA

#join with our HCAHPS dataset   
mort <- HCAHPS %>%
  right_join(mort, by = "Provider.ID")

#plot our results by case
mort %>% 
  select(H_STAR_RATING, MORT_30_AMI) %>%
  unique() %>%
  ggplot(aes(H_STAR_RATING, MORT_30_AMI)) +
  geom_boxplot() +
  ggtitle("AMI Mortality vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

mort %>% 
  select(H_STAR_RATING, MORT_30_HF) %>%
  ggplot(aes(H_STAR_RATING, MORT_30_HF)) +
  geom_boxplot() +
  ggtitle("Heart Failure Mortality vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 2549 rows containing non-finite values (stat_boxplot).

mort %>% 
  select(H_STAR_RATING, MORT_30_STK) %>%
  ggplot(aes(H_STAR_RATING, MORT_30_STK)) +
  geom_boxplot() +
  ggtitle("Stroke Mortality vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 2578 rows containing non-finite values (stat_boxplot).

mort %>% 
  select(H_STAR_RATING, MORT_30_PN) %>%
  ggplot(aes(H_STAR_RATING, MORT_30_PN)) +
  geom_boxplot() +
  ggtitle("Pneumonia Mortatlity vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 2544 rows containing non-finite values (stat_boxplot).

Much like cost, mortality does not appear to have a consistent relationship with star rating. AMI mortality rates only had a slight decline with increased star rating. Surprisingly, HF mortality rates appear to actually increase with higher star ratings.

How do mortality rates from PCI compare?

#merge mortality info from PCI
PCIbyhospital <- PCIbyhospital %>%
  left_join(ID_key, by = "Facility.Id")

combined_data <- PCIbyhospital %>% 
  select(Provider.ID, Number.of.Cases, Risk.Adjusted.Mortality.Rate) %>%
  na.omit()

combined_data <- mort %>%
  select(Provider.ID, H_STAR_RATING) %>%
  right_join(combined_data, by = "Provider.ID") 

combined_data %>%
  ggplot(aes(H_STAR_RATING, Risk.Adjusted.Mortality.Rate)) +
  geom_boxplot() +
  ggtitle("Risk Adjusted Mortality vs. Star Rating") +
  xlab("Star Rating")

Based on the above boxplots, it looks like there is no real difference in mortality rate between the star ratings, though there is actually a slight increase in the rate for 4 star hospitals. This is similar to our earlier finding on heart failures.

Finally, given that we saw some weak relationship between number of cases seen and mortality, let us take a look at its relationship to star rating.

#we will use number of cases seen from the complications table
cost %>%
  filter(Payment.measure.ID == "PAYM_30_PN") %>%
  ggplot(aes(H_STAR_RATING, Denominator)) +
  geom_boxplot() +
  ggtitle("Number of Cases of Pneumonia vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

cost %>%
  filter(Payment.measure.ID == "PAYM_30_AMI") %>%
  ggplot(aes(H_STAR_RATING, Denominator)) +
  geom_boxplot() +
  ggtitle("Number of Cases of AMI vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 19 rows containing non-finite values (stat_boxplot).

cost %>%
  filter(Payment.measure.ID == "PAYM_30_HF") %>%
  ggplot(aes(H_STAR_RATING, Denominator)) +
  geom_boxplot() +
  ggtitle("Number of Cases of HF vs Star Rating") +
  xlab("Star Rating")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

There does not appear to be a significant relationship between cases seen and star rating. While it looks like one star hospitals tend to see less cases of pneumonia and AMI, the same cannot be visually seen for heart failure.

Other predictors

Readmission

We hypothesized that hospitals with higher HCAHPS scores would have lower 30 day readmission rates.

We analyzed the rate of patients readmitted for a complaint related to their original admission within 30 days of discharge for patients of all disease types:

readmissionInfection <- filter(readmission, Measure.Name=="READM-30-COPD-HRRP") %>% 
  select(Hospital.Name, Provider.Number, Number.of.Readmissions, Number.of.Discharges)

#convert to correct type
readmissionInfection[, 3] <- as.numeric(as.character( readmissionInfection[, 3] ))
readmissionInfection[, 4] <- as.numeric(as.character( readmissionInfection[, 4] ))
readmissionInfection <- na.omit(readmissionInfection)

readmissionInfection <- mutate(readmissionInfection, readmissionRate= Number.of.Readmissions/Number.of.Discharges)

readmitHCAHPS <- left_join(HCAHPS, readmissionInfection, by= c("Provider.ID" = "Provider.Number")) %>%
  select(-c(Hospital.Name))

#filter by overall score
readmitHCAHPS %>%
  ggplot(aes(H_STAR_RATING, readmissionRate)) + 
  geom_boxplot() + 
  xlab("Star Rating") + 
  ylab("Readmission Rate")

Indeed, there seems to be a trend towards lower 30 readmission rates for hospitals with higher HCAHPS scores.

If our theory is, indeed, true then we would expect that patients with lower readmission rates would rate their discharge instructions and communication about medicines highly:

readmitHCAHPS %>%
  ggplot(aes(H_COMP_5_LINEAR_SCORE, readmissionRate)) + 
  geom_point() + 
  geom_smooth() +
  xlab("Discharge Instructions Score") + 
  ylab("Readmission Rate") + 
  ggtitle("Discharge Information")

readmitHCAHPS %>%
  ggplot(aes(H_COMP_6_LINEAR_SCORE, readmissionRate)) + 
  geom_point() + 
  geom_smooth(span = .5) + 
  xlab("Communication about Medicine Score") +
  ylab("Readmission Rate") + 
  ggtitle("Communication about Medicine")

There seems to be a downward trend in readmssion rates as the rating of the discharge instructions and medication communication increase.

Time

Time spent in the emergency department is more commonly being looked at as an indicator of the overall patient expirience. We looked at the time it took hospitals to move admitted patients out of the ED to their rooms and compared it with HCAHPS score.

#time in ED
temp <- timely %>%
  filter(Measure.ID=="ED_1b") #Median time from emergency department arrival to emergency department departure for admitted emergency department patients

#add to the readmitHCAHPS
temp <- left_join(temp, readmitHCAHPS, by= "Provider.ID") %>%
  select(Provider.ID, Measure.ID, Score, Sample, readmissionRate, H_STAR_RATING)

#plot the trend against readmission
ggplot(temp, aes(x=temp$readmissionRate, y= temp$Score)) + 
  geom_point() + 
  geom_smooth() +
  xlab("Readmission Rate") + 
  ylab("ED Wait Times (for admitted patients)")

#plot the trend against readmission
ggplot(temp, aes(H_STAR_RATING, temp$Score)) + 
  geom_boxplot() +
  xlab("Readmission Rate") + 
  ylab("ED Wait Times (for admitted patients)")

It appears that as the hospitals with lower HCAHPS star ratings tended to have higher ED wait times.

Complications

BloodClotScore <- complicationshospital %>% 
  filter(Measure.Name=="Serious blood clots after surgery", rm.na=TRUE)

BloodClotScore$Score <- as.numeric(BloodClotScore$Score)
BloodClotScore <- BloodClotScore[order(BloodClotScore$Score),]

names(BloodClotScore)[13] <- "Clot.Score"

CompsHCAHPS <- left_join(BloodClotScore, HCAHPS, by= "Provider.ID") %>%
  na.omit()

CompsHCAHPS %>%
  ggplot(aes(H_STAR_RATING, Clot.Score)) + 
  geom_boxplot() + 
  xlab("Star Rating") + 
  ylab("Clot Rate") + 
  ggtitle("Clots vs. Star Rating")

HCAHPS star ratings do not appear to have dramatically different rates of serious complications like blood clot. It doesn’t seem that people are likely to change their evaluation of the care based on complications.

Finally, to round out our examination of various survival metrics, let us also examine infections.

#merge infection info
tmp <- hospitalinfection %>%
  filter(Year == 2014) %>%
  select(Facility.Id, Infections.observed) %>%
  group_by(Facility.Id) %>%
  summarise(Infections = sum(Infections.observed)) %>%
  left_join(ID_key, by = "Facility.Id") %>%
  na.omit()

combined_data <- tmp %>%
  select(Provider.ID, Infections) %>%
  right_join(combined_data, by = "Provider.ID")

combined_data %>%
  ggplot(aes(H_STAR_RATING, Infections)) +
  geom_boxplot() +
  ggtitle("Total Hospital Infections vs. Star Rating") +
  xlab("Star Rating")

It looks like infections from hospitals is slightly lower for the higher 4 star hospitals, but not significantly different for the other hospitals.

Aggregating the findings

From all of the analysis above, it looks like the best candidates for additional predictors of HCAHPS star rating is readmission, time to emergency room, heart attack mortality, and payments for pneumonia.

#create the final dataset
final_dat <- HCAHPS

final_dat <- paymentandvalue %>%
  filter(Payment.measure.ID == "PAYM_30_PN") %>%
  select(Provider.ID, Payment) %>%
  right_join(HCAHPS, by = "Provider.ID")

final_dat <- readmitHCAHPS %>%
  select(c(Provider.ID, readmissionRate)) %>%
  right_join(final_dat, by = "Provider.ID")

final_dat <- mort %>%
  select(c(Provider.ID, MORT_30_HF)) %>%
  right_join(final_dat, by = "Provider.ID")

final_dat <- temp %>%
  select(Provider.ID, Score) %>%
  right_join(final_dat, by = "Provider.ID") %>%
  select(-c(Provider.ID)) %>%
  na.omit()

#let's make the column names friendly
colnames(final_dat) <- c("ER_Wait", "HF_Mort", "Readmission", "Payment", "Star_Rating", "Cleanliness", "NurseComm", "DoctorComm", "StaffResponsiveness", "PainMmgt.", "MedicationComm.", "Discharge_Info", "Care_Transition", "Overall_Rating", "Quietness")

#turn the star rating into a number
final_dat$Star_Rating <- as.numeric(as.character(final_dat$Star_Rating))

#we have to standardize bc star rating is in a different unit
final_dat_dist <- as.matrix(final_dat) %>%
  scale() %>%
  t() %>%
  dist()

final_dat <- subset(final_dat, select=c(Star_Rating,ER_Wait:Quietness))

heatmap.2(as.matrix(final_dat_dist), 
          margins = c(10,10),
          cellnote= round(as.matrix(final_dat_dist),3), 
          notecex=1.0,
          notecol="cyan",
          dendrogram = 'none',
          trace = 'none',
          key = FALSE)

From the heatmap above, we see that nurse communication and medication communication are still the closet to the star rating. Out of the new predictors we added, mortality rate for heart failures is the closes to star rating.

Now, let’s revisit our knn model

set.seed(1)
control <- trainControl(method='cv', number=10, p=.8)
res_knn <- train(Star_Rating ~ .,
             data = final_dat,
             method = "knn",
             trControl = control,
             preProcess = c("center","scale"),
             tuneLength = 1,
             tuneGrid=data.frame(k=seq(1,15,1)))

res_knn$results %>% 
  ggplot(aes(k, RMSE, ymin= RMSE - RMSESD, 
             ymax = RMSE + RMSESD)) +
  geom_point() + geom_errorbar() +
  ggtitle("RMSE over k")

res_knn
## k-Nearest Neighbors 
## 
## 140 samples
##  15 predictor
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 126, 125, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   RMSE SD     Rsquared SD
##    1  0.2390977  0.8875989  0.17706536  0.10081678 
##    2  0.2566128  0.9014948  0.08829337  0.05838738 
##    3  0.2376135  0.9180396  0.08102504  0.05021170 
##    4  0.2434689  0.9233081  0.05831299  0.03096274 
##    5  0.2509484  0.9211828  0.04267224  0.02308849 
##    6  0.2660685  0.9101836  0.04012624  0.03211314 
##    7  0.2724755  0.9082243  0.04299217  0.03096219 
##    8  0.2822545  0.9003718  0.05013693  0.03309965 
##    9  0.2842629  0.9026982  0.04819130  0.03070106 
##   10  0.2916938  0.8965214  0.04790772  0.03904921 
##   11  0.2939657  0.8993798  0.04257018  0.03777148 
##   12  0.3001591  0.8966483  0.03976504  0.03380846 
##   13  0.3049867  0.8911890  0.04359609  0.03863477 
##   14  0.3104284  0.8903422  0.04249140  0.03634086 
##   15  0.3114241  0.8893125  0.04138787  0.03372770 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 3.

It looks like the best k value is 3. Let’s run our prediction model 10 times and see if our average results have improved.

#let's create a model with k = 3

knnNew_10 <- sapply(1:10, function(x) {
  inTrain <- createDataPartition(y = final_dat$Star_Rating , p = 0.8)
  train_set <- slice(final_dat, inTrain$Resample1)
  test_set <- slice(final_dat, -inTrain$Resample1)
  
  #reset factors for test set
  test_set$Star_Rating  <- factor(test_set$Star_Rating )
  
  #fit the data
  
  pred <- predict(res_knn, newdata = test_set)
  
  pred = round(pred)
  truth = test_set$Star_Rating
  u = union(pred, truth)
  tab <- table(factor(pred, u), factor(truth, u))
  
  confusionMatrix(tab)$overall["Accuracy"]
})

mean(knnNew_10)
## [1] 0.9740741
sd(knnNew_10)
## [1] 0.03049158

Our average accuracy is now around 97%.

Finally, out of curiosity, let’s see how well our additional predictors do without the other HCAHPS metrics.

set.seed(1)

final_dat2 <- final_dat %>%
  select(c(Star_Rating:Payment))

control <- trainControl(method='cv', number=10, p=.8)
res_knn <- train(Star_Rating ~ .,
             data = final_dat2,
             method = "knn",
             trControl = control,
             preProcess = c("center","scale"),
             tuneLength = 1,
             tuneGrid=data.frame(k=seq(1,5,1)))

res_knn$results %>% 
  ggplot(aes(k, RMSE, ymin= RMSE - RMSESD, 
             ymax = RMSE + RMSESD)) +
  geom_point() + geom_errorbar() +
  ggtitle("RMSE over k")

res_knn
## k-Nearest Neighbors 
## 
## 140 samples
##   4 predictor
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 126, 125, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE       Rsquared    RMSE SD     Rsquared SD
##   1  0.9649469  0.10071298  0.20408459  0.11372680 
##   2  0.8830800  0.12480524  0.15366005  0.13884282 
##   3  0.8706896  0.08112718  0.10248839  0.12793359 
##   4  0.8354573  0.08279195  0.09522166  0.09401944 
##   5  0.8044505  0.12537656  0.09782312  0.12198666 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 5.

From the RMSE over K results above, we can intuit without further modeling that by themselves the additional non-HCAHPS predictors are not particularly powerful factors in star ratings.

Conclusion